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Abstract 


Dataflow  graphs  are  described  as  a  machine  language  tor  parallel  machines.  Static  and  dynamic 
dataflow  architectures  are  presented  as  two  implementations  of  the  abstract  dataflow  model.  Static 
dataflow  allows  at  most  one  token  per  arc  in  dataflow  graphs  and  thus  only  approximates  the 
abstract  model  where  unbounded  token  storage  per  arc  is  assumed.  Dynamic  architectures  tag  each 
token  and  keep  them  in  a  common  pool  of  storage,  thus  permitting  a  better  approximation  of  the 
abstract  model.  The  relative  merits  of  the  two  approaches  are  discussed.  Functional  data  structures 
and  I-structures  are  presented  as  two  views  of  data  structures  which  are  both  compatible  with  the 
dataflow  model.  These  views  are  contrasted  and  compared  in  regard  to  efficiency  and  exploitation 
of  potential  parallelism  in  programs.  A  discussion  of  major  dataflow  projects  and  a  prognosis  for 
dataflow  architectures  are  also  presented.^ 
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Dataflow  Architectures 


1.  Dataflow  Model 

The  dataflow  model  of  compulation  offers  a  simple,  yet  powerful,  formalism  for  describing 
parallel  computation.  However,  a  number  of  subtle  issues  arise  in  developing  a  practical  computer 
based  on  this  model,  and  dataflow  architectures  exhibit  substantial  variation,  reflecting  different 
standpoints  taken  on  certain  aspects  of  the  model.  For  example,  in  the  abstract  dataflow  model 
data  values  are  carried  on  tokens  which  travel  along  the  arcs  connecting  various  instructions  in  the 
program  graph,  and  it  is  assumed  that  the  arcs  are  First-in-First-out  (FIFO)  queues  of  unbounded 
capacity  [36].  This  gives  rise  to  two  serious,  pragmatic  concerns:  (1)  How  should  the  tokens  on  arcs 
be  managed?  (2)  How  should  data  structures,  which  are  essentially  composites  of  many  tokens,  be 
represented?  ITie  manner  in  which  these  concerns  are  resolved  has  major  impact,  not  only  on  the 
machine  organization,  but  also  on  the  amount  of  parallelism  that  can  be  exploited  in  programs.  In 
this  paper,  we  examine  the  major  variations  in  dataflow  architectures  with  regard  to  token  storage 
mechanisms  and  data  structure  storage. 

The  paper  is  organized  as  follows.  The  rest  of  Section  1  introduces  dataflow  program  graphs  and 
the  rules  which  determine  when  and  how  operations  are  performed.  Also,  it  explains  why  data 
structures  can  not  be  viewed  as  they  are  in  conventional  programming  languages  without  seriously 
compromising  the  suitability  of  the  dataflow  approach  for  parallel  processing.  Section  2  examines 
the  two  token  storage  mechanisms  adopted  in  current  dataflow  architectures.  I'he  static  dataflow 
approach  places  the  restriction  that  at  most  one  token  can  reside  on  an  arc  at  any  time,  while  the 
tagged-token  dataflow  approach  allows  essentially  unbounded  queues  on  the  arcs  with  no  ordering, 
but  with  each  token  carrying  a  tag  to  identify  its  role  in  the  compuuition.  Section  3  presents  two 
alternatives  to  the  view  of  data  structures  embodied  in  conventional  languages.  The  first  alternative 
treats  a  data  structure  as  a  value  which  is.  conceptually,  carried  on  a  token.  "Functional"  structure 
operations,  such  as  cons,  are  provided  to  create  new  structures  out  of  old  ones.  This  approach  is 
elegant,  but  expensive  to  implement  (even  if  the  data  structure  is  actually  left  behind  in  storage  so 
the  token  carries  only  a  pointer)  and  restricts  parallelism.  The  second  alternative  treats  a  data 
structure  as  a  collection  of  slots,  each  of  which  can  be  written  only  once.  Any  attempt  to  read  a  slot 
before  it  is  written  is  deferred  until  the  corresponding  write  occurs.  Section  4  gives  an  overview  of 
the  major  dataflow  projects.  Finally,  Section  5  gives  our  views  of  what  the  future  holds  for  dataflow 
computers. 

I.i.  Acyclic,  Conditional,  and  Loop  Program  Graphs 

A  daiallow  program  is  described  by  a  directed  graph  where  the  nodes  denote  operations,  eg., 
addition  and  multiplication,  and  the  arcs  denote  data  dependencies  between  operations  (22].  As  an 
example.  Figure  1  shows  die  acyclic  datallow  program  graph  for  the  following  expression. 

let  X  =  a*  b: 

V  =  A*  c 

in  (.t+  v)*(x  —  y)  /  c 


r 

i  ‘ 


Any  arithmetic  or  logical  expression  can  be  translated  into  an  acyclic  dataflow  graph  in  a  straight¬ 
forward  manner.  Data  values  are  carried  on  tokens  which  flow  along  the  arcs.  A  node  may  execute 
(or  fire)  when  a  token  is  available  on  each  input  arc.  When  it  fires,  a  data  token  is  removed  from 
each  input  arc,  a  result  is  computed  using  these  data  values,  and  a  token  containing  the  result  is 
produced  on  each  output  arc. 


a  b  c 


Figure  I:  Acyclic  Dataflow  Graph 

Nodes  si  and  s2  in  Figure  1  are  both  enabled  for  execution  as  soon  as  tokens  are  placed  on  the 
input  arcs  a.  b  and  c.  They  may  fire  simultaneously,  or  one  may  fire  before  the  other;  the  results  are 
the  same  in  either  case.  The  result  of  an  operation  is  purely  a  function  of  the  input  values:  there  are 
no  implicit  interactions  between  nodes  via  side-effects,  say  through  shared  memory.  This  example 
illustrates  two  key  properties  of  the  dataflow  approach;  (1)  parallelism.  Le..  nodes  may  potentially 
execute  in  parallel  unle.ss  there  is  an  explicit  data  dependence  between  them,  and  (2)  determinacy, 
Le..  iesults  do  not  depend  on  the  relative  order  in  which  potentially  parallel  nodes  executed 
Further,  notice  that  by  supplying  several  sets  of  input  tokens,  distinct  computations  can  be 


The  iinboundcd  FIFO  queue  model  presenied  in  this  paper  is  a  ccneralizaiion  of  the  dauiflow  model  originally 
formulated  by  Dennis.  His  model  [22]  requires  that  the  output  ares  of  a  node  he  empty  helore  it  tires,  implying  that  at 
m<«l  one  token  can  reside  on  any  .arc.  Kahn  s  paper  [3<)|  implies  that  the  determinaev  of  datallow  graphs  is  preserved 
even  wiiliout  this  restriction.  Kahn's  result  alM>  permits  ntxles  to  have  internal  state,  hut  we  do  not  amsider  this 
general izauun. 


pipelined  through  the  graph.  In  this  example,  a  single  wave  of  tokens  on  the  input  arcs  produces  a 
single  wave  of  tokens  on  the  output  arcs.  Graphs  which  have  this  property  are  called  well-behaved. 
All  acyclic  graphs  for  arithmetic  and  logical  expressions  are  well-behaved. 

in  order  to  build  conditional  md  loop  program  graphs,  we  introduce  two  control  operators:  switch 
and  merge.  Unlike  the  operator,  switch  and  merge  are  not  well-behaved  in  isolation,  but  yield 
well-behaved  graphs  when  used  in  conditional  and  loop  schemas  [24].  Consider  first  the 
conditional  graph  in  Figure  2.a  which  represents  the  expression  If  x<v  then  x-\-y  else  x-y.  The 
initial  tokens  provide  the  data  input  to  the  switches  as  well  as  input  to  the  predicate  graph.  The 
predicate  graph  yields  a  single  boolean  value  which  supplies  the  control  input  to  all  the  switches  zx\d 
merges.  A  switch  routes  its  data  input  to  the  output  arc  on  the  True  side  or  False  side,  according  to 
the  value  of  the  control  input.  Thus,  the  wave  of  input  tokens  is  directed  to  the  True  or  the  False 
arm  of  the  conditional.  As  long  as  the  arms  of  the  conditional  are  well-behaved  graphs,  a  single 
wave  of  tokens  will  eventually  arrive  at  the  data  input  of  the  appropriate  side  of  the  merge.  The 
merge  selects  an  input  token  from  the  True  or  the  False  side  input  arc,  according  to  the  value  of  the 
control  input,  and  reproduces  the  data  input  token  on  the  output  arc.  To  see  that  the  conditional 
behaves  appropriately  when  waves  of  inputs  are  presented  to  it,  consider  the  tricky  case  in  which 
the  first  wave  of  input  tokens  is  switched  to  the  True  side,  the  second  wave  to  the  False  side  and  the 
tokens  on  the  False  side  of  the  merge  arrive  before  the  tokens  on  the  True  side.  The  sequence  of 
control  tokens  at  the  merge  restores  the  proper  order  among  the  tokens  on  the  output  arcs. 

N 

The  loop  graph  shown  in  Figure  2.b  computes  2  F(i).  The  figure  is  somewhat  stylized  in  that  the 

i  =  1 

dots  are  used  to  indicate  that  the  output  of  the  predicate  is  connected  U)  each  of  the  switches  and 
merges,  and  the  graph  corresponding  to  function  F  is  indicated  by  the  "blob"  containing  F.  The 
initial  values  of  i  and  sum  enter  the  loop  from  the  False  sides  of  the  merges,  and  provide  data  to  the 
predicate  and  swU'  lies.  If  the  predicate  evaluates  to  True,  the  data  values  are  routed  to  the  loop 
body.  Assuming  the  body  is  a  well-behaved  graph,  eventually  a  single  wave  of  results  is  produced, 
providing  tokens  on  the  True  side  of  the  merges.  In  this  way.  values  circulate  through  the  loop  until 
the  predicate  turns  to  False,  which  causes  the  final  values  to  be  routed  out  of  the  loop  and  restores 
the  initial  False  values  on  the  control  inputs  to  the  merges.  Note  that  if  many  waves  of  inputs  are 
provided,  only  one  wave  is  allowed  to  enter  the  loop  at  a  time:  the  second  wave  enters  the  loop  as 
soon  as  the  first  completes,  and  so  on.  Also  note  that  loop  values  need  not  circulate  in  clearly 
defined  waves.  Suppose  Fis  a  very  complicated  graph,  or  simply  does  not  fire  for  a  long  time.  The 
index  variable  /  may  continue  to  circulate,  causing  many  computations  of  Fto  be  initiated.  This 
behavior  is  informally  referred  to  as  dynamic  unfolding  of  a  loop. 

1.2.  Data  Structures 

1  he  dataflow  model  introduced  thus  far  is  fully  general  in  a  formal  compuUitional  sense  [.34],  but 
has  limited  practical  utility  because  of  the  absence  of  data  structures.  Suppose  we  introduce  a  data 
structure  constructor  cons  which  "glues  together"  two  data  values,  producing  a  pair,  and  selectors 
first  and  rest  which  access  the  components  ol  a  pair.  Since  these  new  operators  are  functions,  they 
fit  easily  in  the  dataflow  model,  provided  we  a.ssume  tokens  can  carry  composite  daUt  values.  Note 


Figure  2:  Conditional  and  Lxx)p  Graphs 


that  a  component  of  the  pair  might  be  a  pair,  and  so  on;  thus  we  must  allow  arbitrarily  large 
structures  to  be  carried  on  a  token.  Only  in  the  abstract  model  do  we  think  of  structures  as  being 
carried  on  tokens:  in  practice  tokens  carry  pointers  to  structures  which  are  left  behind  in  storage. 
The  cons  operation  can  be  extended  to  a  general  array  operation  append  which  takes  an  array  x,  an 
index  i.  and  an  element  v,  and  produces  a  new  array  y  such  that  )\j\,  Le.,  the  element  of  y,  is  the 
same  iis  .v[y]  for  ail  j  not  equal  to  i.  and  such  that  >{/]  is  v. 

Fven  though  data  structures  sit  aside  in  storage,  we  must  be  careful  not  to  treat  them  as  we  do 
arra\s  or  records  in  a  conventional  language  such  as  Pascal  or  Fortran.  Consider  the  effect  of  a 
conx  cntional  store  operation  which  modifies  an  element  of  a  data  structure.  In  general  there  may 
be  man>  tokens  carr\  ing  pointers  to  the  structure.  Suppose  one  is  destined  for  a  modify  operation 
and  another  is  destined  for  a  scicci  operation  with  the  siime  index.  Ihe  two  operations  am 
potentialls  e.xecute  in  parallel  because  there  is  no  explicit  data  dependency  from  one  to  the  other. 
Flowever.  the  value  produced  b\  the  select  operation  depends  upon  which  operation  happens  to 
execute  first.  Ihis  defeats  the  determinaev  of  the  model;  it  is  no  longer  true  that  instructions  can 


execute  in  any  order  consistent  with  the  data  dependencies  and  the  results  remain  unafTected  by  the 
order.  Append,  however,  does  not  change  the  data  structure;  it  produces  a  new  structure  that  is 
similar  to  the  old  one.  Consider  the  earlier  scenario,  in  which  a  token  is  destined  for  a  select  and 
another  carrying  a  pointer  to  the  same  structure  is  destined  for  jin  append,  the  select  operates  on  the 
old  structure  and  hence  is  not  affected  by  the  append. 

These  observations  raise  a  tough  question.  Is  it  possible  to  support  data  structures  efficiently  and 
still  maintain  the  elegance  and  simplicity  of  the  dataflow  model?  We  return  to  this  question  in 
Section  3. 

1.3.  User-defined  Functions 

•Another  highly  desirable  property  of  a  model  of  computation  is  the  ability  to  support  user- 
defined  functions.  Each  of  our  examples  represents  a  function  which,  given  a  set  of  input  values, 
produces  a  set  of  results.  Any  good  high-level  language  provides  a  way  of  abstracting  variables  so 
that  an  expression  can  be  turned  into  a  procedure  or  a  function.  At  the  dataflow  graph  level,  a 
user-defined  function  is  no  more  than  an  encapsulation  of  a  graph  which  allows  arguments  and 
results  to  be  transmitted  properly.  Non-recursive  functions  can  be  handled  by  graph  expansion  at 
compile  time.  However,  to  support  user-defined  functions  more  generally,  we  need  an  apply 
operator  which  takes  as  inputs  a  function-value,  (/.e.  description  of  an  encapsulated  dataflow 
graph)  and  a  set  of  arguments,  and  invokes  the  function  on  the  specified  arguments.  There  are 
subtle  issues  involved  in  the  implementation  of  apply.  For  example,  when  should  the  graph 
corresponding  to  the  function  actually  be  created?  After  all  the  arguments  have  arrived?  As  soon 
as  a  panicular  argument  has  arrived?  Often  the  semantics  of  function  application  in  high-level 
languages  requires  the  apply  to  be  implemented  in  a  panicular  way.  However,  all  implementations 
must  suppon  dynamic  expansion  of  graphs  and  a  method  to  route  tokens  to  input  arcs  of  the  newly 
created  graph.  If  a  copy  of  the  function  graph  is  to  be  reused,  then  a  mechanism  is  required  to 
distinguish  tokens  belonging  to  different  invocations.  In  this  latter  case  the  FIFO  queueing  of 
tokens  on  arcs  will  not  suffice.  A  mechanism  for  user-defined  functions  develops  naturally  out  of 
the  tagged-token  approach,  so  we  will  return  to  this  topic  after  discussing  various  implementations. 

1.4.  Datafiow  Graphs  as  a  Parallel  Machine  Language 

We  can  view  dataflow'  graphs  as  a  machine  kuiguage  for  a  parallel  machine  where  a  node  in  a 
dataflow  graph  represents  a  machine  instruction.  1  he  instruction  format  for  a  dataflow  machine  is 
essentially  an  adjacency  list  representation  of  the  program  graph:  each  instruction  contains  an 
op-code  and  a  list  of  destination  instruction  addresses.  Recall,  an  instruction  or  node  may  execute 
whenever  a  token  is  available  on  each  of  iLs  input  arcs,  and  when  it  fires  the  input  tokens  are 
consumed,  a  result  value  is  computed,  and  a  result  token  is  produced  on  each  output  arc.  This 
dictates  the  following  basic  instruction  cycle:  (1)  detect  when  an  operation  is  enabled  (this  is 
tantamount  to  collecting  operand  values).  (2)  determine  the  operation  to  be  performed,  ie..  fetch 
the  instruction.  (.^)  compute  rcsulLs.  and  (4)  generate  result  tokens,  f  his  is  the  basic  instruction 
cycle  of  any  dataflow  machine;  however,  there  remains  tremendous  flexibility  in  the  details  of  how 
this  cycle  is  performed. 


It  is  interesting  to  contrast  dataflow  instructions  with  those  of  conventional  machines.  In  a  von 
Neumann  machine,  instructions  specify  the  addresses  of  the  operands  explicitly  and  the  next 
instruction  implicitly  via  the  program  counter  (except  for  branch  instructions).  In  a  dataflow 
machine,  operands  (tokens)  carry’  the  address  of  the  instruction  for  which  they  are  destined,  and 
instructions  contain  the  addresses  of  the  destination  instructions.  Since  the  execution  of  an 
instruction  is  dependent  upon  the  arrival  of  operands,  the  management  of  token  storage  and 
instruction  scheduling  are  intimately  related  in  any  dataflow  computer. 

Dataflow  graphs  exhibit  two  kinds  of  parallelism  in  instruction  execution.  Tlie  first  we  might  call 
spatial  parallelism:  any  two  nodes  can  potentially  execute  concurrently  if  there  is  no  data 
dependence  between  them.  I'he  second  form  of  parallelism  results  from  pipelining  independent 
waves  of  cc  mputation  through  the  graph.  In  the  next  section  we  show  that  it  is  possible  to  execute 
several  instances  of  the  same  node  concurrently,  thereby  exploiting  this  temporal  parallelism. 

2.  Token  Storage  Mechanisms 

The  essential  point  to  keep  in  mind  in  considering  ways  to  implement  the  dataflow  model  is  that 
tokens  imply  storage.  The  token  storage  mechanism  is  the  key  feature  of  a  dataflow  architecture. 
While  the  dataflow  model  assumes  unbounded  FIFO  queues  on  the  arcs  and  FIFO  behavior  at  the 
nodes,  it  turns  out  to  be  very  difficult  to  implement  this  model  exactly.  1  wo  alternative  approaches 
have  been  researched  extensively.  The  first  we  call  static  dataflow,  it  proxides  a  fixed  amount  of 
storage  per  arc.  I  he  other  approach  we  call  dynamic  or  tagged-token  dataflow,  it  provides  dynamic 
allocation  of  token  storage  out  of  a  common  pool  and  assumes  that  tokens  carry  tags  to  indicate 
their  logical  position  on  the  arcs. 

2.1.  Static  Dataflow  Machine 

1  he  one-:okcn-per-arc  restriction  can  be  incorporated  in  the  model  by  extending  the  firing  rule  to 
require  that  all  output  arcs  of  a  node  be  empty  before  that  node  is  enabled.  With  this  restriction, 
storage  for  tokens  can  be  allocated  prior  to  execution,  since  the  number  of  arcs  is  fixed  for  a  given 
graph.  T  he  basic  instruction  fonnai  is  expanded  to  include  a  slot  for  each  operand.  Distributing 
tokens  to  ciestinatio/i  instniciions  involves  little  more  than  storing  data  values  in  the  appropriate 
slots.  7  he  ;  lots  have  presence  flags  to  indicate  whether  or  not  a  value  has  been  stored.  Thus,  when  a 
token  is  stored,  it  is  straightlorwaid  to  determine  if  the  other  inputs  are  all  present.  This  idea 
underlies  tie  static  dataflow  machines  proposed  by  Dennis  and  his  co-workers  [21,  23,  25]  (see 
Figure  3). 

Insiruciii  n  templates  reside  in  the  activity  store  itnd  addresses  of  enabled  insinictions  reside  in  the 
'nsiruciion  queue.  The  fetch  unit  removes  the  first  entry  in  the  instruction  queue,  fetches  the 
corresponding  op-code.  data,  and  destination  list  from  the  activity  store,  forms  them  into  an 
uperaiion  packet,  forwards  the  operation  packet  to  an  available  operation  unit,  and  finally  clears  the 
operand  slots  in  the  template.  T  he  operation  unit  cxmiputcs  a  result,  generates  a  result  packet  for 
each  dcstin  itisai.  and  sends  the  result  packcLs  to  the  update  unit.  Instructions  are  identified  b\  their 
.iddie.ss  in  die  activity  store,  so  the  update  unit  stores  each  result  and  checks  the  presence  bib  to 
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Figure  3:  Static  Dataflow  Architecture 

determine  if  the  corresponding  activity  is  enabled.  If  so.  the  addres.s  of  the  instruction  is  placed  in 
the  instruction  queue.  These  units  operate  concurrently,  so  instructions  are  processed  in  a  pipelined 
fashion. 


k.: 


It  is  possible  to  connect  many  such  processors  together  via  a  packet  communication  network.  The 
activity  store  of  each  processor  am  be  loaded  with  a  part  of  a  dataflow  graph.  Notice  that  large 
delays  in  the  communication  network  do  not  affect  the  performance.  Le..  tlie  number  of  operations 
performed  per  second,  as  long  as  enough  enabled  nodes  are  present  in  each  processor.  This  is  an 
important  characteristic  of  dataflow  machines;  they  can  use  parallelism  in  programs  to  hide 
communication  latency  between  processors. 


2.1.1.  Enforcing  the  One-Token-Pcr-Arc  Restriction 

The  above  description  of  the  sialic  machine  skips  over  a  very  important  and  rather  subtle  point: 
the  one-ioken-per-arc  restriction  of  Dennis'  model.  Suppose  the  units  communicate  with  a  full 
send-ackiK)w ledge  protocol,  i.c..  a  token  moves  to  the  next  unit  onK  after  that  unit  has  signalled 
lliai  it  c;in  accept  the  token,  and  the  Update  unit  writes  into  an  operand  slot  only  if  the  slot  is  empty. 
K\en  with  these  assumptions,  multiple  tokens  belonging  m  the  same  arc  may  coexist  in  the 
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machine,  since  there  may  be  buffering  in  the  units  and  communication  neiwork.  It  is  infeasible  for 
the  update  or  fetch  units  to  determine  that  there  is  no  token  in  the  system  for  a  panieniar  arc.  If 
multiple  tokens  can  coexist  on  an  arc  then  the  FIFO  assumption  may  be  violated,  because  two 
firings  of  a  node  may  execute  on  different  operation  units  within  a  PE  and  the  one  that  is  logically 
second  in  the  queue  may  finish  first.  1  he  communication  system  will  ultimately  direct  these  result 
tokens  to  the  same  destination  node,  but  in  the  wrong  order.  To  sec  how  the  dataflow  model 
malfunctions  if  tokens  on  an  arc  get  out  of  order,  consider  the  example  in  Figure  2.b  with  the  plus 
operator  replaced  by  minus.  The  results  of  F(l)  and  F(2)  can  potentially  reside  on  die  left  input  to 
the  concurrently,  but  if  F(2)  is  processed  before  F(l)  the  answer  will  be  wrong^. 

If  the  one-token-per-arc  restriction  can  be  enforced,  then  the  problems  due  to  reordering  of 
tokens  will  not  arise,  llie  restriction  cannot  be  enforced  at  the  hardware  level,  but  its  effect  can  be 
achieved  by  executing  only  graphs  which  have  the  property  that  no  more  than  one  token  can  reside 
on  any  arc  at  any  stage  of  execution.  It  is  possible  to  transform  any  dataflow  graph  into  a  dataflow 
graph  with  this  property.  In  the  simplest  transformation,  for  each  arc  in  the  graph,  an 
acknowledgment  arc  is  added  in  the  opposite  direction.  A  token  on  an  acknov/ledgmeni  arc 
indicates  that  the  corresponding  data  arc  is  empty.  Initially,  a  token  is  placed  on  each 
acknowledgment  arc.  A  node  is  enabled  to  fire  when  a  token  is  present  on  each  input  arc  and  each 
incoming  acknowledgment  arc.  At  the  hardware  level,  the  only  difference  between  the  two  kinds  of 
arcs  is  that  the  value  of  a  token  on  an  acknowledgment  arc  is  ignored,  instead  of  the  presence  bits 
for  operands,  a  counter  is  associated  with  each  instruction.  The  counter  is  initialized  to  the  number 
of  operands  plus  the  number  of  incoming  acknowledgment  arcs  and  decremented  by  the  update 
unit  whenever  an  operand  or  acknowledgment  arrives.  The  node  is  enabled  when  the  counter 
reaches  zero.  Notice  that  the  generation  of  acknowledgments  must  be  delayed  enough  after  the 
operation  packet  is  formed  so  that  there  is  no  way  for  results  of  the  second  firing  to  overtake  the 
first. 

The  one-token-per-arc  restriction  is  not  entirely  .satisfactory.  Even  though  many  of  the 
acknowledgment  arcs  in  a  program  graph  can  be  eliminated  [40].  the  ;unount  of  token  traffic 
increases  by  a  factor  of  1.5  to  2.  the  time  between  successive  firings  of  a  node  increases  drastically, 
and  most  importantly,  the  amount  of  parallelism  that  can  be  exploited  in  a  program  is  reduced.  In 
particular,  the  dynamic  unfolding  of  loops  is  severely  constrained,  as  shown  by  the  following 
example.  Suppose  Fin  Figure  2.b  is  replaced  by  the  acyclic  graph  in  Figure  1  (perhaps  we  take  the 
inputs  a.  h.  and  c  to  be  /).  It  should  be  possible  to  pipeline  four  distinct  computations  through  tills 
graph,  but.  unfortunately,  with  the  static  approach  the  second  initiation  must  wtiit  until  the  divide 
node  fires,  cletiring  the  input  arc  for  r.  This  problem  has  received  substantial  tttieniion  [20]  and  can 
be  partially  overcome  by  introducing  extra  identity  operators  to  balance  the  path  lengths  in  a  graph. 
Ia)r  example,  if  three  identity  nodes  are  added  on  the  right  input  to  the  divide  in  1  iguie  1,  the  path 
lengths  would  be  perfectly  balanced.  I  he  balancing  approach  assumes  tital  execution  Drncs  for  all 
operators  are  the  same  and  communication  delays  between  opemtors  are  consttiui,  Neitlicr 
assumptimi  is  realistic  and  balancing  becomes  computationally  intracLible  witlu>ut  these 
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assumptions. 

We  note  in  passing  that  modeling  imbounded-FIFO  dataflow  graphs  by  fixed  storage  dataflow 
graphs  (introduction  of  acknowledgment  arcs  is  one  example  of  such  modeling),  changes  the 
"meaning"  of  a  dataflow  graph  in  a  subtle  way.  A  graph  may  be  deadlock  free  in  the  unbounded 
case,  but  its  corresponding  graph  with  acknowledgment  arcs  may  deadlock  under  certain 
circumstances.  These  shortcomings,  in  addition  to  the  inability  to  handle  user-defined  functions, 
motivated  work  on  the  more  general  dynamic  dataflow  approach  discussed  next. 

2.2.  Dynamic  or  Tagged-Token  Dataflow 

Each  token  in  a  static  dataflow  machine  must  carry  the  address  of  the  instruction  for  which  it  is 
destined.  This  is  already  a  tag.  Suppose,  in  addition  to  specifying  the  destination  node,  the  tag  also 
specifies  a  particular  firing  of  the  node.  Then,  two  tokens  participate  in  the  same  firing  of  a  node  if 
and  only  if  their  tags  are  the  same.  Another  way  of  looking  at  tags  is  simply  as  a  means  of 
maintaining  the  logical  FIFO  order  of  each  arc.  regardless  of  the  physical  arrival  order  of  tokens. 
The  token  which  is  supposed  to  be  the  value  to  flow  along  a  given  arc  carries  i  in  its  tag.  The 
trick  is  to  give  simple  tag  generation  rules  for  the  control  operators,  switch  and  merge.  Arvind  and 
Gostelow  [7]  have  given  such  rules  for  Dennis’  operators  [22].  However,  if  only  well-behaved 
graphs  are  considered,  then  it  is  possible  to  develop  even  simpler  tag  manipulation  rules  [9].  We 
briefly  explain  these  latter  rules  as  well  as  the  effect  of  tagging  on  the  dataflow  model  presented  in 
Section  1. 

2.2.1.  Tagging  Rules 

We  intend  the  tagged-token  approach  to  support  user-defined  functions,  so  a  program  is  viewed 
as  a  collection  of  graphs,  called  code-blocks,  where  each  graph  is  either  acyclic  or  a  single  loop.  A 
node  is  identified  by  a  pair  <code-block,  instruction  addrc.ss>.  Tags  have  four  parts:  <invocation  ID, 
iteration  ID.  code-block,  instruction  addressX  where  the  latter  two  identify  the  destination 
instruction  and  the  former  two  identify  a  particular  firing  of  that  instruction.  The  iteration  ID 
distinguishes  between  different  iterations  of  a  particular  invocation  of  a  loop  code-block,  while  the 
invocation  ID  distinguishes  between  different  invocations.  All  the  tokens  for  one  firing  of  an 
instruction  must  have  identical  tags,  and  enabled  instructions  are  detected  by  finding  sets  of  tokens 
with  identical  tags.  Ibkens  also  carry  a  port  number  which  specifies  the  input  arc  of  the  destination 
node  on  which  the  token  resides:  this  is  not  part  of  the  tag.  and  thus  docs  not  participate  in 
matching. 

Consider  first  the  execution  of  an  acyclic  graph  such  as  in  Figure  1.  A  set  of  tokens  whose  tags 
differ  only  in  the  instruction  address  part  is  placed  on  the  input  arcs.  When  an  instruction  fires,  it 
generates  Lags  for  each  result  token  by  using  the  destination  address  in  the  instruction  as  the 
instruction  address  part  and  copying  the  rest  from  the  input  tag.  For  conditionals  the  scenario  is 
similar,  but  there  are  two  destination  lists.  A  single  wave  of  inputs  is  steered  through  one  arm  or 
the  other.  We  will  ensure,  however,  that  no  two  waves  of  inputs  carry  the  same  invtKation  and 
iteration  IDs  in  their  Utgs.  1  hits,  for  any  given  tag.  a  data  item  carrying  that  tag  will  arrive  on  at 
most  one  side  of  the  merge.  Since  the  order  olTokcns  on  the  arcs  is  immaterial,  there  is  no  need  to 
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orchestraie  the  merge  via  the  output  of  the  predicate  as  in  the  FIFO  model;  the  streams  of  tokens 
produced  by  the  two  arms  can  be  merged  in  an  arbitrary  fashion.  This  modified  conditional  schema 
is  shown  in  Figure  4.a.  The  ®  is  not  an  operator;  it  merely  denotes  that  two  arcs  converge  on  the 
same  port 


Figure  4:  Conditional  and  Loop  Graphs  for  Tagged  Approach 

The  loop  requires  a  control  operator,  named  D.  to  increment  the  iteration  ID  portion  of  the  tag 
(see  Figure  4.b).  The  iteration  ID  of  each  initial  input  to  the  loop  is  zero.  Like  the  conditional 
schema,  the  merges  can  be  eliminated  from  the  loop  schema  because  the  Uigs  on  the  tokens  on  the 
True  and  False  sides  of  a  merge  will  be  disjoint.  The  D'^  operator  is  used  to  reset  the  iteration  ID  to 
zero.  To  implement  nested  loops  and  user-defined  functions,  an  additional  operator  is  required  to 
assign  unique  invocation  i[)  s.  The  apply  operator  takes  a  code-block  name  and  an  argument  as 
input,  and  forwards  the  argument  to  the  designated  tx)de-block  after  assigning  it  a  new  invocation 
ID  and  setting  its  iteration  ID  to  zero.  The  Lag  for  the  output  arc  of  the  apply  node  is  also  sent  to  the 
invoked  gr;iph  so  the  result  can  be  returned  to  the  destination  of  the  apply  node,  as  if  it  were 


generated  by  the  apply  node  itself.  One  may  visualize  the  action  of  an  apply  as  coloring  input 
tokens  in  such  a  manner  tliat  they  do  not  mix  with  tokens  belonging  to  other  invocations  of  the 
same  code-blcx:k.  Of  course,  there  must  be  a  complementary  operator  to  restore  the  original  color 
for  the  result  tokens.  The  interested  reader  is  referred  to  [10]  for  more  detail. 

The  tagged-token  approach  eliminates  the  need  to  maintain  FIFO  queues  on  the  arcs  (though 
unbounded  storage  is  still  assumed),  and  consequently  offers  more  parallelism  than  the  abstract 
model  presented  in  Section  1.  In  fact,  it  has  been  shown  that  no  interpreter  can  offer  more 
parallelism  than  the  tagged-token  approach  [8]. 

2.2.2.  Tagged-Token  Dataflow  Machine 

A  machine  proposed  by  Arvind  et  al.  [4]  is  depicted  in  Figure  5.  It  comprises  a  collection  of 
processing  elements  (PE's)  connected  via  a  packet  communications  network.  Each  PE  is  a  complete 
dataflow  computer.  The  waiting-matching  store  is  a  key  component  of  this  architecture.  When  a 
token  enters  the  waiting-matching  stage,  its  tag  is  compared  against  the  tags  of  the  tokens  resident 
in  the  store.  If  a  match  is  found,  the  matched  token  is  purged  from  the  store  and  is  forwarded  to  the 
instruction  fetch  stage,  along  with  the  entering  token.  Otherwise,  the  incoming  token  is  added  to 
the  matching  store  to  await  its  partner.  (Instructions  are  restricted  to  at  most  two  operands  so  a 
single  match  enables  an  activity.)  Tokens  which  require  no  partner,  Le.,  are  destined  for  a  monadic 
operator,  bypass  the  waiting-matching  stage. 

Once  an  activity  is  enabled,  it  is  processed  in  a  pipelined  fashion  without  further  delay.  The 
invocation  id  in  the  tag  designates  a  triple  of  registers  (CBR,  DBR.  and  MAP)  which  contain  all  the 
information  associated  with  the  invocation.  CBR  contains  the  base  address  of  the  code-block  in 
program  memory,  OBR  contains  the  base  address  of  a  data  area  which  holds  values  of  loop 
variables  that  behave  as  constants,  and  MAP  contains  mapping  information  describing  how 
activities  of  the  invocation  are  to  be  distributed  over  a  collection  of  PEs.  The  instruction  fetch  stage 
is  thus  ab'  to  locate  the  instruction  and  any  required  constants.  T  he  op  code  and  data  values  are 
pas.sed  to  the  ALU  for  processing.  In  parallel  with  the  ALU,  the  compute  tag  stage  accesses  the 
destination  list  of  the  instruction  emd  prepares  result  tags  using  the  mapping  information.  Result 
values  and  tags  are  merged  into  tokens  and  passed  to  the  network,  whereupon  they  are  routed  to  the 
appropriate  waiting-matching  store. 

It  is  important  to  realize  that  if  the  waiting-matching  store  ever  gets  full  the  machine  will 
immediately  deadlock;  tokens  can  leave  the  waiting-matching  section  only  by  matching  up  with 
incoming  tokens.  A  similar  argument  can  be  made  to  show  that  if  the  total  storage  between  the 
output  of  the  waiting-matching  section  and  the  paths  leading  to  its  input  is  bounded,  a  deadlock  can 
occur  [17].  I  heretore.  in  addition  to  the  functional  units  described  in  Figure  5.  each  PE  must  have  a 
token  buffer.  This  bufTcr  can  be  placed  at  a  variety  of  points,  including  at  the  output  stage  or  the 
input  stage,  depending  on  the  relative  speeds  of  the  various  stages.  Ik)lh  the  waiting-matching  store 
and  the  token  buffer  have  to  be  large  enough  to  make  the  probability  of  overflow  acceptably  small. 

Ihc  apply  operator  is  implemented  as  a  small  graph.  The  invocation  request  is  ptisscd  to  a 
sysicm-wide  resource  manager  so  that  resources  such  as  a  new  invocation  ii).  program  memory  etc. 
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Figure  5:  Processing  Element  of  the  MIT  Tagged  Token  Dataflow  Machine 

can  be  allocated  for  the  new  invocation.  A  code-block  invocation  can  be  placed  on  essentially  any 
collection  of  processors.  Various  instances,  te.  firings,  of  instructions  are  assigned  to  PE’s  within  a 
collection  by  "hashing"  the  lags.  A  variety  of  mapping  schemes  have  been  developed  to  distribute 
the  most  frequently  encountered  program  structures  efficiently.  The  MAP  register  assigned  to  a 
code-block  invocation  keeps  the  hashing  function  to  be  used  for  mapping  activities  of  the  code- 
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Efficient  handling  of  "loop  constants"  is  a  fairly  low-level  optimization,  but  important  enough  to 
deserve  mention.  In  the  abstract  model  variables  which  arc  invariant  over  all  iterations  for  a 
particular  invocation  of  a  loop,  but  vary  for  different  invocations,  must  be  circulated.  Variable  N  in 
Figure  2.b  is  an  example  of  such  a  variable.  Values  of  such  variables  cannot  be  placed  in  the 
instructions  without  making  the  graph  non-reentrant  To  avoid  this  overhead,  most  dataflow 
machines  provide  a  mechanism  for  efficient  handling  of  loop  constants.  As  an  example  of  the 
importance  of  this  optimization,  note  that  the  inner  loop  of  a  straightforward  matrix  multiply 
program  has  seven  loop  variables,  five  of  which  are  loop  constants.  In  the  MIT  tagged-token 
machine,  storage  for  such  constants  is  allocated  in  program  memory  when  a  loop  code-block  is 
invoked;  DBR  points  to  this  area,  allowing  these  constants  to  be  fetched  along  with  the  instruction. 
The  constant  area  is  deallocated  when  the  invocation  terminates.  If  the  loop  invocation  is  spread 
over  multiple  PE’s,  setting  up  constant  areas  is  a  little  tricky,  since  an  image  must  be  made  in  each 
PE  before  the  first  iteration  is  allowed  to  begin. 

The  tagged-token  architecture  circumvents  the  shortcomings  identified  in  the  static  architecture, 
but  it  also  presents  some  difficult  issues.  In  the  static  machine,  the  storage  has  to  be  allocated  for  all 
arcs  of  a  program  graph,  though  tokens  may  coexist  only  on  a  small  fraction  of  them.  In  contrast, 
token  storage  is  used  more  efficiently  in  the  tagged-token  approach,  because  storage  requirement  is 
determined  by  the  number  of  tokens  that  can  coexist  However,  programs  exhibit  much  more 
parallelism  under  the  tagged-token  approach  (actually  even  more  so  than  the  unbounded-FIFO 
model),  and  consequently,  can  drive  the  token  storage  requirement  so  high  that  the  machine  may 
deadlock  [17].  This  has  turned  out  to  be  a  serious  enough  problem  in  practice  that  we  now  generate 
only  those  graphs  in  which  the  parallelism  is  bounded.  In  the  dynamic  machine,  the  mechanism  for 
detecting  enabled  activities  appears  more  complex,  since  matching  is  required  as  opposed  to 
decrementing  a  counter.  Further,  tokens  carry  more  tagging  information  though  no 
acknowledgment  tokens  are  needed.  If  tags  are  to  be  kept  relatively  small,  there  must  be  facilities 
for  reusing  tags.  This,  in  turn,  requires  detecting  the  completion  of  code-block  invocations,  an 
action  which  generally  involves  a  nontrivial  amount  of  computation.  This  task  would  be  virtually 
impossible  if  the  graphs  were  not  "self-cleaning",  which  is  a  consequence  of  graphs  being  well- 
behaved.  Finally,  an  efficient  mechanism  is  required  for  allocating  resources  to  new  code-block 
invocations. 

2.3.  Tags  as  Memory  Addresses  and  vice  versa 

1  he  performance  of  a  tagged-token  machine  is  crucially  dependent  upon  the  rate  at  which  the 
wailing-matching  section  can  process  tokens.  Though  the  size  of  the  waiting-matching  store 
depends  upon  many  factors,  based  on  our  preliminary  studies  we  expect  that  it  will  be  in  the  range 
of  lOK  to  lOOK  tokens.  In  this  .^ize  range,  a  completely  associative  memory  is  ruled  out,  but  a  hash 
table,  possibly  augmented  with  a  small  associative  memory  is  viable,  and  the  wailing-matching 
sections  of  the  machines  discus.sed  in  Section  4  are  organized  as  such.  Hashing  basically  involves 
calculating  the  address  of  a  slot  in  the  hash  table  by  applying  some  "hash"  function  to  the  lag  of  the 
token  (sec  [3.3]  for  examples  of  the  hashing  functions  used  in  a  tagged-machine). 

Gino  Maa.  a  member  of  our  group.  h;is  suggested  that  Uigs  should  be  viewed  as  addresses  for  a 
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virtual  memory  in  which  the  primitive  operation  is  store-extract.  Given  a  data  and  an  address,  the 
store-extract  operation  stores  the  data  in  the  slot  specified  by  the  address  if  the  slot  is  empty, 
otherwise  the  contents  of  the  slot  are  read  and  the  slot  is  considered  empty.  A  page  of  virtual 
memory  may  contain,  for  example,  tokens  with  identical  contexts.  It  is  clear  that  only  a  tiny 
fraction  of  the  virtual  address  space  will  be  occupied  at  any  given  time  and  physical  storage  is 
required  only  for  this  fraction.  Thus,  the  problem  of  the  design  of  the  waiting-matching  section 
becomes  the  problem  of  implementing  a  very  large  virtual  memory  (40-bit  addresses  oe  larger), 
where  a  non-existant  page  is  allocated  automatically  upon  an  attempt  to  access  it  and  deallocated 
when  all  its  entries  are  empty.  Caches  may  be  effective  in  organizing  such  a  memory  as  there  is 
evidence  to  suggest  that  when  an  incoming  token  finds  its  partner,  the  partner  is  usually  among  the 
most  recently  arrived  tokens  [15].  The  difference  between  the  implementation  of  a  large  virtual 
address  space  and  the  hashing  approach  discussed  earlier  may  be  minimal,  however  viewing  tags  as 
addresses  allows  us  to  place  many  variations  of  static  and  dynamic  machines  on  a  continuum,  in 
which  the  address  on  a  token  in  the  static  machine  becomes  the  tag  on  a  token  in  the  dynamic 
machine. 

Consider  extending  the  static  machine  by  operators  to  allocate  activity  store  dynamically,  thus 
allowing  procedure  calls  to  be  implemented.  In  ail  such  implementations,  a  part  of  the  address 
serves  the  purpose  of  the  "context”  part  of  the  tag  in  the  dynamic  machine,  and  the  task  of 
allocating  a  new  context  is  subsumed  by  the  task  of  allocating  activity  storage.  A  common 
optimization  in  such  schemes  is  to  separate  the  operand  slots  of  an  instruction  from  the  rest,  and  to 
allocate  a  new  template  containing  operand  slots  for  a  code-block  at  the  time  of  invocation.  To 
achieve  sharing  of  a  code-block  among  several  invocations  requires  relocation  registers  like  CBR, 
DBR,  etc.  of  the  MIT  tagged-token  machine.  Another  variation  discussed  in  the  literature 
eliminates  the  need  for  acknowledgment  arcs  by  allowing  only  acyclic  graphs  [26. 44].  Since  a  loop 
can  be  modeled  as  a  recursive  procedure,  this  offers  a  trade-off  between  the  cost  of  extra  procedure 
calls  and  the  savings  due  to  the  elimination  of  acknowledgments.  As  discussed  earlier,  there  are 
subtle  issues  associated  with  the  implementation  of  the  apply  operator,  eg.,  the  time  of  storage 
allocation  afl’ects  the  amount  of  parallelism  that  can  be  exploited  by  the  machine. 

Coming  from  the  other  direction,  a  variation  of  the  tagged-token  machine  that  has  been  proposed 
by  David  Culler  and  Greg  Papadopoulos  (also  of  our  group)  is  to  replace  the  waiting-matching 
section  of  the  tagged-token  machine  by  a  token  storage  that  is  explicitly  allocated  at  the  time  of 
procedure  invocation.  It  is  possible  to  do  so  if  the  storage  requirement  of  a  code-block  can  be 
determined  prior  to  invoking  it.  The  type  of  bounded-loop  graphs  that  we  propose  to  run  on  the 
machine  have  this  property. 

After  examining  some  of  the  variations  discussed  here,  the  distinction  between  the  static  and 
dynamic  dataflow  becomes  somewhat  fuzzy.  Choosing  a  good  design  among  the  ones  proposed  (or 
one  vet  to  be  proposed)  is  an  active  research  topic  in  this  field.  The  only  general  statement  we  can 
make  is  that  giving  the  programmer  or  the  compiler  a  greater  control  over  the  management  of 
res(*urccs  increases  his  responsibility  and  burden,  but  may  provide  significant  performance 
improvements  and  may  simplify  the  design  of  the  machine. 
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3.  Data  Structures 

Section  1  described  how  data  structures  can  be  incorporated  in  the  dataflow  model  without 
sacrificing  its  elegance  or  utility  for  parallel  computation.  We  now  illustrate  the  difficulties  in 
implementing  "functional"  data  structures  efficiently  and  describe  an  alternative  view  known  as 
1-structures.  This  latter  approach  offers  an  efficient  implementation  without  sacrificing 
determinacy.  and  allows  more  parallelism  to  be  exploited  in  programs  than  the  "functional" 
approach. 

3.1.  Functional  Operations  On  Data  Structures 

The  simplest  form  of  "functional"  data  structures  is  reflected  in  the  operations  cons,  first,  and  rest. 
Cons  glues  two  values  together  to  form  a  pair;  Jirst  and  rest  select  values  from  such  pairs.  Clearly, 
we  cannot  allow  arbitrarily  large  values  to  be  carried  on  a  token,  so  pairs  must  be  maintained  in 
storage  with  tokens  carrying  the  addresses  of  these  pairs.  To  this  end.  dataflow  machines  provide 
structure  storage,  which  should  be  considered  as  a  special  operation  unit  with  internal  storage.  The 
unit  is  shared  by  all  PE's  and  is  capable  of  performing  many  concurrent  structure  operations. 

To  see  how  the  structure  store  and  its  associated  operations  behave,  we  can  step  through  the 
execution  of  a  first  operation.  A  first  operation  is  enabled  by  the  arrival  of  a  token  carrying  a 
pointer.  Neither  the  fetch  unit  in  the  static  machine  nor  the  ALU  in  the  tagged-token  machine  can 
access  the  structure  storage  directly.^  Thus,  a  new  packet  containing  the  read  request  and  the 
address  or  tag  of  the  destination  node  of  the  first  operation  is  sent  to  the  structure  storage.  Upon 
receipt  of  such  a  request,  the  structure  storage  controller  produces  a  token  containing  the  left  value 
of  the  pair  and  sends  it  to  the  appropriate  destination  instruction;  this  is  depicted  in  Figure  6. 

Similarly,  for  the  cons  operator,  two  input  data  values  together  with  the  destination  node  address 
(or  tag)  are  sent  to  a  structure  storage  unit.  The  structure  controller  allocates  storage  for  the  pair, 
writes  the  elements,  and  sends  a  pointer  for  the  newly  allocated  storage  to  the  destination 
instruction. 

The  implementation  of  large,  flat  data  structures,  such  as  arrays,  presents  difficult  design  trade¬ 
offs.  If  arrays  arc  implemented  as  linked  lists  using  cons,  selection  operations  are  inefficient.  If, 
instead,  array  elements  are  stored  contiguously,  as  a  generalization  of  the  pairing  operation,  the 
oppezjz/ operation  becomes  costly.  This  is  because  append  involves  creating  a  new  array  and  copying 
all  except  one  element  from  the  old  array.  Efficient  implemenuations  of  arrays  have  been 
researched  extensively  [1.  31]  and  two  key  ideas  have  emerged  to  reduce  copying.  First,  if  the  array 
descriptor  (or  pointer)  fed  to  the  append  operator  is  the  only  descriptor  in  existence  for  the 


■  Not  providinp  direct  iicccss  to  ;i  hirpe  slornpc  shared  by  many  PE's  is  certainly  a  design  choice,  but  a  fundamental  one. 
In  a  inacliine  with  nianv  processors  and  many  structure  controllers,  the  time  to  access  a  pariicular  memory  controller  may 
be  verc  laiee  If  the  instruction  priKCssing  pipeline  bkK'ks  for  structure  operations,  ilie  perlormancc  ol  the  machine  will 
be  ercaib  aflcc  icd  by  the  latency  of  the  communication  system.  One  be.iiiiy  ol  datallow  macliincs  is  they  can  he  made 
eMrcmcly  lolci.nii  ol  latency,  and  thus  can  sustain  high  perlomiance  wiih  many  processors  working  on  a  single  problem. 
Dc'i.iilcd  argumeni^  to  this  accouni  can  be  lound  in  Arvind  and  lannucci  [1  Ij. 
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Figure  6:  Action  of  a  first  operation 

corresponding  array,  the  array  can  be  updated  in  place  without  risk  of  causing  a  read-write  race. 
Second,  if  the  array  is  represented  as  a  tree,  then  only  the  nodes  along  the  path  to  the  appended 
element  need  be  regenerated:  the  rest  of  the  tree  can  be  shared.  This  reduces  the  amount  of 
allocation  and  copying,  but  increases  the  time  for  selection. 

3.2. 1'Structures 

The  "functional"  view  of  structures  imposes  unnecessary  restrictions  on  program  execution, 
regardless  of  how  efficiently  it  is  implemented.  Consider  the  simple  example  co/7s(f(a),g(a));  the 
cons  will  not  be  enabled  until  both  fi(a)  and  g(a)  have  completed.  Thus,  another  part  of  the  program 
which  uses  the  first  element  of  the  pair,  but  not  the  second,  must  wait  until  both  elements  have 
been  computed.  Such  data  structures  are  called  strict  in  programming  language  jargon.  In  contrast, 
cons  can  be  treated  as  a  non-strict  operator  [27],  allowing  an  element  of  a  pair  to  be  used  regardless 
of  whether  the  other  element  has  been  produced.  The  resultant  increase  in  parallelism  is  far  greater 
than  one  might  naively  imagine. 

The  firing  rule  for  non-strict  cons  is  difficult  to  implement.  One  way  to  circumvent  this  difficulty 
is  to  treat  cons  as  a  triplet  of  operations,  as  shown  in  Figure  7.  1  he  implicit  storage  allocation  of 
strict  cons  becomes  visible  as  a  new  type  of  node  in  the  dataflow  graph.  The  descriptor  produced  by 
the  allocate  operator  is  passed  to  the  two  store  operations,  in  addition  to  the  subsequent  select 
operations.  This  allows  consumption  of  a  structure  to  proceed  in  parallel  with  production,  but  also 
raises  an  awkward  problem:  a  first  or  rest  operation  may  be  executed  before  the  corresponding 
store.  This  seemingly  catastrophic  situation  am  be  resolved  with  the  help  of  a  smart  structure- 
storage  controller.  If  a  read  request  arrives  for  a  storage  cell  which  has  not  been  written,  tlie 
controller  defers  tlie  read  until  a  write  arrives.  This  is  the  basic  idea  behind  l-struciurc  storage. 

Referring  to  Figure  8.  each  storage  cell  amtains  status  bits  to  indicate  that  the  cell  is  in  one  of 
three  possible  slates.  (1)  FRFSF.N  I ;  The  word  contains  valid  data  which  can  be  freely  read  as  in  a 
convcmi('nal  memory.  Any  attempt  to  write  it  will  be  signalled  as  an  error  (2)  AllSf'N  I :  Nothing 
hits  been  written  into  the  cell  since  it  w.as  last  all<Kaled.  No  attempt  has  been  m.ide  to  read  the  cell: 
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Figure  7:  Implementation  of  Non-strict  Cons 


it  may  be  written  as  for  conventional  memory.  (3)  WAITING:  Nothing  has  been  written  into  the 
cell,  but  at  least  one  attempt  has  been  made  to  read  it  When  it  is  written,  ail  deferred  reads  must  be 
satisfied.  Cells  change  state  in  the  obvious  ways  when  presented  with  requests.  Destination  tags  of 
deferred  read  requests  are  stored  in  a  part  of  the  I-structure  storage  specially  reserved  for  that 
purpose. 


Whilie  l-structure  storage  can  be  used  to  implement  non-strict  cons,  to  exploit  the  full  potential  of 
this  form  of  storage,  functional  languages  can  be  augmented  with  explicit  allocate  and  store 
operations.  From  a  programmer's  perspective,  an  l-structure  is  an  array  of  slots  [42]  which  are 
initially  empty,  and  which  can  be  written  at  most  once.  Regardless  of  when  or  how  many  times  a 
select  instruction  for  a  particular  slot  is  executed,  the  value  returned  is  always  the  same.  This 
preserves  the  determinacy  property  of  the  model,  l-structurcs  are  not  "functional"  data  structures; 
they  are  "monoionic  objects"  which  are  constructed  incrementally,  hence  their  name. 


l-structures  provide  the  kind  of  synchronization  needed  for  exploiting  producer-consumer 
parallelism  without  risk  of  read-write  races,  l-structure  read  requests  for  which  the  data  is  present 
require  about  the  same  time  as  conventional  reads,  and  with  special  hardware  [32]  deferred  reads 
can  be  processed  quickly.  Thus,  as  long  as  most  read  requests  follow  the  corresponding  write,  the 
overhead  of  l-structure  memory  is  small,  and  the  utility  is  enormous. 


The  benefit  of  non-strict  structures  in  terms  of  the  amount  of  parallelism  exhibited  by  programs  is 
surprisingly  large.  For  example,  methods  in  which  a  large  mesh  is  repeatedly  transformed  into  a 
new  version  by  performing  some  calculation  for  e;«ch  point  are  rammon  in  numerical  computing. 
Some  such  methods  show  tremendous  parallelism  because  all  mesh  points  can  be  computed 
simultaneously.  However,  even  when  this  is  not  possible  because  of  data  dependencies,  it  is  usually 
possible  to  overlap  the  computation  of  several  versions  of  the  mesh.  This  latter  fonn  of  parallelism 
can  be  exploited  only  if  the  mesh  is  represented  as  a  non-strict  structure. 
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Figure  8:  I -Structure  Storage 


4.  Current  Dataflow  Projects 

We  now  present  an  overview  of  some  of  the  more  important  dataflow  projects,  restricting  our 
attention  to  those  that  have  built  or  are  currently  building  a  dataflow  machine.  In  particular,  we  do 
not  address  how  dataflow  concepts  have  influenced  high-performance  von  Neumann  computers 
being  designed  today. 

4.1.  Static  IVIachinc  Projects 

It  is  no  exaggeration  to  say  that  all  dataflow  projects  started  in  the  seventies  were  directly  based 
on  Dennis'  seminal  work  [221.  Such  projects,  besides  Dennis’  own  project,  include  the  LAU  project 
in  l  oulouse.  France  [16].  the  lexas  Instruments  dataflow  project  [35].  the  Hughes  dataflow  machine 
[2<S].  and  several  projects  in  Japtin  [48. 41].  Even  the  work  on  uigged-token  machines  at  the 
University  of  Manchester  in  England  and  the  University  of  California  at  Irvine  was  inspired  by 
Dennis'  work. 
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4.1.1.  The  MIT  Static  Machine  Dataflow  Project 

Dennis'  group  at  MI  T  has  proposed  and  refined  several  static  dataflow  architectures  over  the 
years  [21. 46. 19.  25}.  and  have  implemented  an  eight-processor  engineering  model  of  the  static 
machine  shown  in  Figure  3[19|.  T  he  processing  elements  (PE)  were  built  out  of  AMD  bit-slice 
micro-processors  luid  were  connected  by  a  packet-switched  butterfly  network  composed  of  2x2, 
byte-serial  routers  with  send-acknowledge  protocol.  The  structure  controller  was  not  implemented. 
Dataflow  graphs  for  the  machine  were  compiled  from  the  language  VAL  [2],  A  PDP-11  served  as  a 
front  end.  While  the  machine  operated  successfully,  it  was  only  large  enough  to  run  toy  programs. 
Also,  because  of  microcoding,  the  PE's  were  far  slower  than  the  routers.  The  1  exas  Instruments 
machine  [35].  which  was  architecturally  similar  to  Dennis’  machine,  was  built  by  modifying  four 
conventional  processors.  Even  though  these  machines  proved  to  be  too  slow  to  generate 
commercial  interest  in  dataflow  machines,  they  have  had  marked  influence  on  instruction 
scheduling  in  high-performance  machines  intended  for  scientific  computing. 

4.1.2.  The  NEC  Dataflow  Machines 

The  latest  machines  which  may  be  classified  as  static  machines  are  NEC's  NED1PS[48]  and 
Image  Pipelined  Processor  (IPP)  /J.PD7281  [41].  NEDIPS  is  a  32-bit  machine  intended  for  scientific 
computation  and  uses  high-speed  logic,  while  the  IPP  is  a  single  chip  processor  of  similar 
architecture,  intended  as  a  building  block  for  highly  parallel  image  processing  systems.  We  focus 
on  the  latter  machine.  Generally,  image  processing  involves  applying  a  succession  of  filters  to  a 
stream  of  image  data.  Thus,  each  IPP  chip  may  be  loaded  with  a  dataflow  program  for  a  specific 
filter  or  several  filters. 

The  NEC  designers  have  generalized  the  machine  described  in  Section  2.1  by  allowing  multiple 
tokens  per  arc.  To  see  how  this  is  done,  consider  once  again  the  static  machine  in  Figure  3, 
Instruction  templates  must  be  enlarged  to  include  a  collection  of  operand  slots.  If  we  assume  that 
the  operands  of  an  enabled  instruction  are  immediately  removed  from  the  activity  store  and 
forwarded  to  the  operation  units,  then  tokens  cannot  accrue  in  the  slots  for  both  the  left  and  right 
arcs  simultaneously.  Thus,  both  arcs  can  share  the  same  slots  as  long  as  a  flag  is  provided  in  the 
instruction  template  to  indicate  on  which  arc  (left  or  right)  the  current  tokens  reside.  Further,  the 
collection  of  slots  in  an  instruction  are  managed  as  a  cyclic  buffer,  with  two  pointers  marking  the 
head  and  tail  of  the  queue.  When  iui  incoming  token  is  for  the  same  arc  as  the  arc  to  which  the 
previously  arrived  tokens  in  the  instruction  belong,  the  update  unit  adds  the  data  value  of  the 
incoming  token  to  the  tail  of  the  queue.  Otherwise,  the  data  value  at  the  head  is  removed  and 
placed  in  the  instmetion  queue,  along  with  incoming  token.  Notice  it  is  not  necessary'  for  all 
instruction  templates  toaintain  the  .s;irne  number  of  operand  slots. 

In  the  IPP  implementation,  the  three  components  of  the  instruction  template,  op-code,  operand 
slots,  and  destination  list  are  placed  in  three  separate  memories  so  they  can  be  accessed  at 
consecutive  stages  of  the  instruction  pipeline.  E;ich  IPP  provides  storage  for  64  instructions.  128 
arcs,  and  512  l6-bit  data  elcmenLs.  which  can  be  partitioned  into  queues  of  up  to  16  slots  per 
instruction,  i'he  IPP  also  allows  regions  of  the  data  memory  to  be  used  for  constants  and  tables.  In 
additiiMi.  special  hardware  operations  are  provided  for  generating,  coalescing,  spliltittg.  and  merging 
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streamsof  tokens.  A  novel  technique  is  employed  to  govern  the  level  of  activity  in  tiic  instruction 
pipeline:  instructions  with  multiple  destinations  are  queued  separately  from  thc'se  .vii’n  single 
destinations,  so  when  the  pipeline  is  starved  the  multiple-destination  iiistruciion  queue  is  given 
priority,  and  when  the  instruction  pipeline  is  full  the  other  queue  is  favored.  BuircieJ  inpui/output 
ports  which  support  a  full  send-acknowledge  protocol  are  provided,  allowing  up  to  14  IPl^’s  to  be 
connected  in  a  ring.  The  system  relies  on  a  host  processor  to  provide  input/oui[)iu,  liookkceping, 
and  operating  system  supporL 

IPP  does  not  handle  acknowledgments  specially  and  requires  that  operand  storage  is  allocated 
statically,  ie..  by  the  programmer  or  compiler.  1  he  programmer  must  tune  the  program  graph  to 
avoid  buffer  overflows  and  ensure  that  tokens  do  not  get  out  of  order,  'fhis  makes  program 
development  for  this  machine  a  tedious  task.  The  buffer  overflow  problem  is  much  less  severe  in 
N  EDI  PS  because  it  provides  much  more  data  memory  (64K  words)  than  IPP.  Still  tlie  problem  is 
serious  enough  to  cause  the  designers  to  modify  NEDIPS  so  operand  buffers  can  be  extended  or 
shrunk  dynamically  in  128  word  increments.  As  discus.scd  in  Section  2.3.  tliis  e.xtension  also  makes 
it  difficult  to  classify  NEDIPS  as  a  static  machine. 

NEDIPS  and  IPP  are  the  first  commercially  available  dataflow  processors,  and  .mgai  dlesi  of  their 
commercial  success,  which  only  time  will  tell,  they  are  major  milestones  in  non-von  Neumann 
architectures. 


4.2.  Tagged-Token  Machine  Projects 

T  he  utgged- token  dataflow  approach  was  conceived  independently  by  tvv'o  research  groups,  one  at 
Manchester  University  in  Manchester.  England  and  one  at  the  University  of  California  at  Irvine. 
The  tagged-token  architecture  presented  in  Section  2.2  is  based  on  work  b\  the  latter  group,  which 
has  since  moved  to  the  Massachusetts  Institute  of  Technology.  The  prototype  tagged  token 
machine  completed  at  the  University  of  Manchester  in  1981  (29]  is  an  important  milestone,  and 
presents  some  interesting  variations  on  the  machine  described  above.  A  number  of  other  prototype 
efforts  are  in  progress  in  Japan,  most  notably  in  Amamiya  s  group  at  Nil  [J.  47j.  and  ,Sigma-l  at 
El  L  which  is  discussed  later  in  this  section. 


4.2.1.  The  Manchester  Dataflow  Project 

fhe  Manchester  machine  is  essentially  like  the  instruction  processing  section  shown  in  Figure  5. 
It  is  a  single  ring  consisting  of  a  token  queue,  a  matching  unit,  an  instruction  store,  and  a  btink  of 
A\  U's.  The  ALU's  are  mierocoded  and  fairly  slow.  It  has  demonstrated  reasonable  p<  rfruniance 


(1.2  MIPS)  with  this  arrangement,  although  the  choice  of  many  sk)w  M  U's  lues  received  stjme 
criticism  because  all  the  ALU's  can  be  easily  replaced  by  a  single  Lest  A1  U  Tokens  arc  96  bits 
wide,  ineliiding:  37  biLs  for  data.  .36  for  lag.  .and  22  for  destination  atidrrrw,  I  he  maicliing  unit  is  a 
two-lcvcl  store,  fhe  first  level  has  a  capacity  vtf  IM  tokens  and  uses  .i  p.iraliel  ha^liing  scheme  to 
map  an  incoming  lag  into  a  set  of  eight  .slots.  The  contents  of  the  sclecicd  slots  arc  tcssocialively 
matched  against  the  incoming  lag.  The  second-level  overfiow  store  uses  hashing  with  linked  fists. 

T  he  Manchc'cicr  machine  has  nt)  structure  store  />er  ac.  Instead,  a  host  oi  cxniu  matching 
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operations  are  provided  so  that  the  matching  store  can  function  ;is  a  structure  store  as  well  [49],  The 
analog  of  an  invocation  It)  can  be  treated  its  an  array  descriptor,  and  the  iteration  ID  can  function  as 
the  index,  so  a  tag  can  represent  an  array  clemenL  A  store  operation  generates  a  token  which  goes 
to  the  maurhing  unit  and  slicks  there.  A  read  operation  generates  a  token  which  matches  with  an 
element  stu^ck  in  the  store,  extracts  a  copy  of  it.  and  forwards  the  cop\  to  the  destination  of  the  read 
operation,  btit  leaves  the  sticky  element  in  the  store.  If  the  read  token  fails  to  find  a  partner  in  the 
store,  it  cycles  through  the  ring,  busy-waiting.  When  the  structure  is  deallocated,  its  elements  must 
be  purged  from  the  store,  fhis  approach  has  not  proved  very  successful.  It  increases  the  already 
large  load  on  the  matching  unit  and  communication  network,  degrades  the  performance  of  the 
matching  unit  on  standard  operations,  its  well  as  makes  its  design  much  more  complex.  To  resolve 
these  problems,  the  Manchester  group  is  developing  a  structure-store  similar  to  the  I-Structure 
store.  Sticky  tokens  are  also  used  for  loop  constants  (discussed  in  Section  2.2).  The  iteration  part  of 
the  tag  is  ignored  in  performing  the  match  and  the  sticky  token  remains  in  the  store  even  when  a 
match  is  performed.  Cleaning  up  the  matching  store  when  a  loop  terminates  presents  difficulties. 

The  Manchester  machine  has  provided  a  target  for  a  number  of  dataflow  languages  and  has  run  a 
number  of  sizable  applications.  Extensions  to  multi-ring  machines  are  being  studied  through 
simulation.  Work  continues  in  areas  related  to  controlling  parallelism  and  instruction  set  design. 

4.2.2.  Sigma- 1  at  Electrotechnical  Laboratory,  Japan 

Under  the  auspices  of  the  Japanese  National  Supercomputer  Project,  the  Electrotechnical 
Laboratory  is  developing  a  machine  [50]  based  on  the  MIT  tagged-token  architecture.  The  current 
proposal  is  to  produce  a  prototype  32-bit  machine  capable  of  100  Mflops,  by  the  end  of  1986.  The 
individual  processors  are  pipelined  and  operate  on  a  100ns  cycle.  The  network  is  packet-switched 
and  composed  of  4x4  routers.  I  he  engineering  effon  involved  in  this  project  is  substantial, 
including  the  development  of  a  Tboard  PE  and  a  1-board  structure  memory.  Together,  these  will 
require  eight  to  ten  custom  cMOS  gate-array  chips  and  a  custom  VLSI  chip.  The  PE  will  contain 
l6k  words  of  program  memory.  8k  words  of  token  buffering,  and  64k  words  of  waiting-matching 
store,  and  the  structure  memory  256k  words.  (The  memory  sizes  may  be  increased  by  a  factor  of 
four  by  the  time  the  machine  is  built.)  The  machine  will  have  up  to  180  boards,  divided  roughly 
half  and  half  between  the  structure  memory  and  ALU  boards.  A  6-board  version  of  the  PE  has 
been  operational  since  November  1984. 

A  number  of  interesting  design  choices  have  been  made  in  Sigma-1.  A  shon  latency  two-stage 
processor  pipeline  is  employed  to  execute  code  with  low  parallelism  efficiently.  In  the  first  stage, 
instruction  fetch  and  matching  arc  performed  simultaneously.  If  the  match  fails,  the  fetched 
instruction  is  discarded.  In  the  second  suige.  destination  Utgs  arc  generated  in  parallel  with  the 
ALU  operation.  Ibkens  arc  transferred  through  the  network  as  8U-bii  packets.  I  wo  cycles  are 
required  to  receive  a  packet,  but  the  first  stage  of  the  proccs.sor  pipeline  operates  on  the  first  40  bits 
of  the  packet  (the  tag)  while  the  second  4C  bits  arc  received.  I  he  wailing-matching  store  is 
implemented  as  a  chained  hash  Uible.  1  he  first  operand  of  a  pair  is  inserted  in  the  matching  store  in 
4  cycles;  matching  the  second  token  of  a  pair  has  an  expected  lime  of  2.6  cvclcs.  Sticky  tokens  are 
employed  for  kx)p  constants,  however,  the  designers  of  the  EH.  machine  have  intimated  that  the 
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utility  of  this  approach  may  not  warrant  the  added  complexity  in  the  matching  unit.  The  structure 
controllers  support  deferred  reads.  Rather  than  support  a  general  heap  storage  model,  in  which 
dtita  objecw  may  have  arbitrary  lifetimes,  structures  are  deleted  when  the  procedure  which  created 
the  structure  terminates.  This  simplifies  storage  management  and  is  probably  acceptable  for  writing 
numerical  applications,  the  intended  application  area  for  the  machine. 

4.2.3.  The  MIT  Tagged-Token  Project 

Not  surprisingly,  the  tagged-token  machine  presented  in  Section  2.2  reflects  the  approach  of  the 
authors’  group  at  MIT.  This  machine  developed  through  a  sequence  of  stages  [7.  30, 14, 13, 12, 4] 
from  theoretical  work  on  the  U-interpreter  model  (8, 9],  The  MIT  group  has  focused  on  developing 
an  entire  dataflow  system,  rather  than  on  hardware  development  per  se.  Two  soft  prototypes  have 
been  implemented  to  serve  as  vehicles  for  studying  architectures,  program  development  and 
resource  management.  A  simulator  provides  a  detailed  model  of  the  machine,  including  internal 
timings,  while  a  dataflow  emulator  is  being  developed  to  run  on  the  Multiprocessor  Emulation 
Facility  [6]  (MEF),  to  study  dynamic  behavior  of  larger  applications.  The  MEF  is  a  collection  of 
Lisp  machines  (38  Texas  Instruments  Explorers  and  8  Symbolics  3600  s)  which  will  be  connected  by 
a  high  bandwidth  packet-switched  network  in  the  near  future.  Each  Lisp  machine  emulates  a 
dataflow  PE.  Both  the  simulator  and  emulator  execute  graphs  produced  by  our  compiler  from  the 
high-level  dataflow  language  Id  [10, 42].  A  number  of  reasonably  large  benchmarks  are  being 
studied  on  the  soft-prototypes  of  the  MIT  Tagged-Token  machine,  including  a  complex 
hydrodynamics  and  heal  conduction  code. 

5.  Prognosis 

In  this  paper  we  have  outlined  two  salient  issues  in  dataflow  architectures:  token  storage 
mechanisms  and  data  structures,  and  surveyed  several  dataflow  machines.  We  have  not  attempted 
to  cover  all  the  current  research  topics;  for  the  interested  reader,  these  include:  demand-driven 
evaluation  [43].  controlled  program  unfolding  and  deadlock  avoidance  [17, 45, 5],  efficient 
procedure  invocation,  storage  reclamation,  relationships  with  parallel  reduction 
architectures  [38. 18. 37],  network  design  and  topology,  and  semantics  of  programming  languages 
with  l-structures.  However,  dataflow  architectures  are  of  more  than  academic  interest,  so  in 
conclusion  we  consider  their  potential  in  the  real  world. 

loday  a  vast  collection  of  .single-board  computers  are  available  which  offer  roughly  1  MIPS  at 
low  cost:  these  are  touted  as  building  blocks  for  multiprocessors.  Can  dataflow  machines  compete? 
It  is  not  clear  if  a  single  dataflow  processor  can  achieve  the  perfonnance  of  a  von  Neumann 
priKcssor  at  the  same  hardware  cost.  The  dataflow  instruction-scheduling  mechanism  is  clearly 
more  complex  than  incrementing  a  program  counter.  An  engineering  effort  subsUintially  beyond 
any  of  the  current  dataflow  projects  is  required  to  make  a  fair  comparison.  I  he  Sigma-1  project  is 
an  intporunt  step  in  this  direction.  The  quaslion  becomes  more  interesting  when  we  consider 
machines  with  multiple  pr(Kes.sors.  where  the  dataflow  scheduling  mechanism  yields  significant 
benelits.  In  the  basic  von  Neumann  machine  the  poKcssor  is.sues  a  memory  request  and  waits  for 
the  result  to  he  produced.  I  he  memory  cycle  time  is  invariably  greater  than  the  prcKcssor  cycle 
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time,  so  computer  architects  devote  tremendous  effort  to  reduce  the  amount  of  waiting.  This 
problem  is  much  more  severe  in  a  multiprocessor  context  because  the  time  to  process  a  memory 
request  is  generally  much  greater  than  in  a  single  processor  and  is  unpredictable.  Further,  most 
traditional  techniques  for  reducing  the  effects  of  memory  latency  do  not  work  well  in  a 
multiprocessor  setting.  The  dataflow  approach  can  be  viewed  as  an  extreme  solution  to  the  memory 
latency  problem  --  the  processor  never  waits  for  responses  from  memory;  it  continues  processing 
other  instructions.  Instructions  are  scheduled  based  on  the  availability  of  data,  so  memory 
responses  are  simply  routed  along  with  the  tokens  produced  by  processors.  Thus,  even  if  individual 
dataflow  processors  do  not  yield  the  performance  per  dollar  of  a  conventional  processor,  we  can 
expect  them  to  be  better  utilized  than  a  conventional  processor  in  a  multiprocessor  setting.  For 
large  enough  collections  of  processors  they  should  be  cost  effective  as  well  as  show  absolute 
performance  not  achievable  by  conventional  processors.  But  it  is  not  yet  clear  where  this  threshold 
lies. 

The  preceding  discussion  suggests  that  dataflow  machines  are  likely  to  be  competitive  in  high- 
performance  range,  however,  we  would  not  make  such  a  claim  lightly.  It  is  unlikely  that  a  large 
collection  of  1  MIPS  machines  of  any  ilk  will  compete  with  a  few  very  high  performance  processors, 
/.e,  processors  which  can  perform  10  to  100  MFLOPs  each.  To  compete  among  supercomputers,  it 
may  be  necessary  to  engineer  a  dataflow  machine  with  the  technology  and  finesse  employed  in 
conventional  supercomputers.  This  is  a  major  undertaking,  far  beyond  any  of  the  dataflow  projects 
currently  proposed.  Most  supercomputers  include  vector  accelerators  to  improve  performance  on  a 
restricted  class  of  programs.  It  remains  to  be  seen  how  effective  these  will  be  in  a  multiprocessor 
context  and  the  extent  to  which  analogous  accelerators  will  be  needed  for  dataflow  machines. 

This  paper  has  focused  on  architectural  issues,  and  accordingly  has  scarcely  touched  on  the  high- 
level  programming  model  which  accompanies  dataflow  machines.  Nonetheless,  programmability 
of  parallel  machines  is  critical.  Conventional  programming  languages  are  imperative  and  sequential 
in  nature:  do  this,  then  do  that,  etc.  Efforts  to  use  these  languages  for  describing  parallel 
computation  have  been  ad  hoc  and  unwieldy,  greatly  increasing  the  difficulty  of  the  already  onerous 
programming  task.  The  programmer  must  determine  what  synchronization  is  required  to  avoid 
read-write  races.  Even  so,  subtle  timing  bugs  are  common.  A  class  of  languages,  qzW&A  functional 
languages,  completely  avoid  these  synchronization  problems  by  disallowing  "updatable"  variables. 
Functional  languages  employ  function  composition,  rather  than  command  sequencing,  as  the  basic 
concept  and  can  be  translated  into  dataflow  graphs  easily,  exposing  parallelism.  These  languages 
can  be  augmented  with  l-structures  to  make  data  structures  more  efficient,  without  sacrificing 
dcterminacy  or  parallelism.  It  is  our  belief  that  dataflow  architectures  together  with  these  new 
liuiguages  will  show  the  programming  generality,  performance  and  cost  effectiveness  needed  to 
make  parallel  machines  widely  applicable. 


We  gratefully  acknowledge  Robert  lannuccis  drawings  of  various  dataflow  architectures,  from 
which  we  have  "borrowed"  liberally.  Many  ideas  in  this  paper  derive  from  the  common  heritage  of 
the  Computation  Structures  Group  at  M.I.T.  Laboratory  for  Computer  Science,  and  we  are 
indebted  to  its  members  for  providing  a  stimulating  research  environment  We  are  grateful  to 
Steven  Brobst  Jack  Dennis.  K.  Ekanadham,  Bhaskar  Guharoy,  Gino  Maa  Hitoshi  Nohmi,  Greg 
Papadopoulos,  Natalie  Tarbet,  and  Ken  Traub  for  their  valuable  comments  on  drafts  of  this  paper. 
Of  course,  we  take  responsibility  for  the  opinions  presented  and  any  remaining  errors. 
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